Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization
نویسندگان
چکیده
منابع مشابه
UniqueProt: creating representative protein sequence sets
UniqueProt is a practical and easy to use web service designed to create representative, unbiased data sets of protein sequences. The largest possible representative sets are found through a simple greedy algorithm using the HSSP-value to establish sequence similarity. UniqueProt is not a real clustering program in the sense that the 'representatives' are not at the centres of well-defined clus...
متن کاملComplexity of choosing subsets from color sets
We raise and investigate the algorithmic complexity of the following problem. Given a graph G = (V; E) and p-element sets L(v) for its vertices v 2 V such that jL(u) L(v)j p + r for all edges uv 2 E, do there exist q-element subsets C (v) L(v) with C (u) \ C (v) = ; for all uv 2 E ? Here p; q; r are positive integers, p q and p + r 2q. We characterize precisely which triples (p; q; r) admit a p...
متن کاملEliminating redundancy among protein sequences using submodular optimization
Motivation: Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success in many fields but is not yet widely used in biology. We apply submodular optimization to the problem of removing redundancy in protein sequence data sets. This is a common step in many bioinformatics and structural biology workflows, including creation of non-redundant t...
متن کاملSelection of representative protein data sets.
The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to ex...
متن کاملMining Non-Redundant Sets of Generalizing Patterns from Sequence Databases
Sequential pattern mining techniques extract patterns corresponding to frequent subsequences from a sequence database. A practical limitation of these techniques is that they overload the user with too many patterns. Local Process Model (LPM) mining is an alternative approach coming from the field of process mining. While in traditional sequential pattern mining, a pattern describes one subsequ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proteins: Structure, Function, and Bioinformatics
سال: 2018
ISSN: 0887-3585
DOI: 10.1002/prot.25461